1. Introduction

Description of the Project

This project uses the World Bank’s Education Statistics dataset to explore long-term trends and relationships among key indicators of the US education system. Our goal is to understand how structural aspects of education (access, staffing, and outcomes) have evolved over time and how changes in one variable may influence changes in another. We analyze variables across different categories and examine differences between sexes. By focusing on longitudinal, cross-national data, we aim to identify potential patterns and meaningful interactions that could inform educational policy and investment.

The analysis of various indicators allows us to examine temporal changes and potential external factors that might shape educational outcomes.
We selected two major indicator categories that reflect different dimensions of the education system:

  1. Enrollment Rate vs. Attendance Rate
    We will compare enrollment and attendance rates across different levels of schooling (primary, secondary, tertiary) to test whether higher enrollment leads to consistent attendance over time.

  2. Number of Teachers vs. Teaching Staff Compensation
    We will investigate how changes in staffing levels correspond with compensation across educational stages.

Motivation

When we were brainstorming datasets to go with, we had difficulty choosing a topic because the possibilities were endless. We looked at several options from public health to climate to economics, but we eventually narrowed it down to education. This is something that all of us are connected to and that our classmates and readers can easily relate to. We initially considered focusing on Emory-specific data but broadened our scope to the World Bank’s Education Statistics for its larger context and long-term coverage.

As students at an American university, we were especially interested in examining data trends within the United States. Education policy and access here have evolved significantly over time, and exploring long-term patterns in enrollment, attendance, and staffing can reveal whether increased investment in education translates into better participation and equity. This directly relates to us, as college students who have undergone the three main levels of education and were told that our own efforts, most notably in high school, would determine our outcomes later in life. But how much do individual outcomes actually depend on the student, and how much on the school systems and structures that support them?


#what is this meant to contain # 2. Dataset - Data Introduction: source, scope, any ethical considerations - Explain different variables


3. Setup & Packages

We will perform Exploratory Data Analysis (EDA) in R using the tidyverse suite of libraries along with some supplementary packages for smoother visualization.
Our analysis will include the following components:

The “r setup” code chunk applies to every code chunk in the file, so this will allow us to create a “global environment” to use in our analysis. First, we install and load all necessary packages, then set global options for reproducibility and consistent output formatting. We also create functions that will streamline table formatting for data any outputs of interest.

# install.packages("kableExtra")
# install.packages("tidyverse")
# install.packages("readr")
# install.packages("knitr")
# install.packages("plotly")
# install.packages("rmarkdown")
# install.packages("broom")
# install.packages("ggpubr")
# install.packages("janitor")
library(kableExtra)
library(tidyverse) # core data manipulation & visualization
library(readr) # efficient CSV import
library(knitr)
library(plotly) # interactive data visualization
library(janitor) # column name cleaning & simple tabulations
library(lubridate) # data parsing and time handling
library(rmarkdown)
library(broom)
library(ggpubr)

#functions to format dts
pretty_r2_table <- function(df, caption_text){
  df %>%
    kable("html", caption = caption_text) %>%
    kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE) %>%
    column_spec(3, color = ifelse(df$p.value < 0.05, "red", "black"))
}

pretty_cor_table <- function(x, y, var1_name = NULL, var2_name = NULL) {
  cor_res <- cor.test(x, y)
  tibble(
    Variable1 = var1_name,
    Variable2 = var2_name,
    Correlation = round(cor_res$estimate, 3),
    p.value = signif(cor_res$p.value, 3)) %>%
    kable("html", caption = paste0("Correlation: ", var1_name, " vs ", var2_name)) %>%
    kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE) %>%
    column_spec(4, color = ifelse(cor_res$p.value < 0.05, "red", "black"))
}

set.seed(123)
opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)

4. Data Import & Cleaning

This dataset is very large and contains data from various countries. As previously mentioned, we are focusing only on the US data. We start from reading the raw EdStats dataset (EdStats_v01.csv) and filter only the United States(Country code == "USA"). We also drop fully empty columns (2024 column) as they will not be needed in our analysis. As a check, print any columns that contain all NA values and proceed only if the null set is output. Then, we write the filtered dataset to EdStats_USA.csv so that later chunks can work only with our US data without filtering for it multiple times.

This simplifies the dataset to the country of interest and removes all-NA columns to keep downstream transformations clean.

#set up data for use
#read data, skip empty 2024 column
data <- read_csv("EdStats_v01.csv", show_col_types=FALSE, col_types=cols(`2024`=col_skip()))

# filter for only USA data
data_clean <- data %>%
  filter(grepl("USA", `Country code`))

#make sure no NA cols remain
if (length(colnames(data_clean)[colSums(is.na(data_clean)) == nrow(data_clean)]) == 0) {
  message("No NA cols remain.")
} else {
  message("NA columns remain.")
}
## No NA cols remain.
#save filtered dataset
write_csv(data_clean, "EdStats_USA.csv")

To keep downstream steps tidy and reproducible, we create two thematic subsets aligned to our research questions:

  1. Enrollment vs. Attendance — indicators with “total net enrolment rate” or “total net attendance rate” (exclude parity indices and sex-specific series for apples-to-apples totals).
    • Note that “enrolment” is misspelled, this is corrected later in code.
    • Dropped all-NA columns and saved as EdStats_attend.csv.
  2. Teachers vs. Compensation — teacher counts and teaching staff compensation as a percentage of total institution spending (exclude sex splits and qualification percentages).
    • Dropped all-NA columns and saved as EdStats_teacher.csv.

This ensures we work with clean, directly comparable variables.

# 1. Enrollment vs. Attendance
# filter for enrollment/attendance info, select only sex-combined data
data <- read_csv("EdStats_USA.csv", show_col_types = FALSE)
data_filter_attend <- data %>%
  filter(grepl("total net enrolment rate|total net attendance rate", `Indicator name`, ignore.case = TRUE)) %>%
  filter(!grepl("female|male", `Indicator name`)) %>%
  filter(!grepl("adjusted gender parity index", `Indicator name`, ignore.case = TRUE)) %>%
  select(where(~!all(is.na(.)))) %>% #view() %>%
write_csv("EdStats_attend.csv")

# 2. Teachers vs. Compensation
#filter for num teachers and teaching staff compensation info, select only sex-combined data
data <- read_csv("EdStats_USA.csv", show_col_types = FALSE)
data_filter_teacher <- data %>%
  filter(grepl("teachers in|teaching staff compensation", `Indicator name`, ignore.case = TRUE)) %>%
  filter(!grepl("female|male", `Indicator name`)) %>%
  filter(!grepl("percentage of qualified", `Indicator name`, ignore.case = TRUE)) %>%
  select(where(~!all(is.na(.)))) %>% #view() %>%
write_csv("EdStats_teacher.csv")

5. Attendance vs. Enrollment — Reshape, Compare, and Quantify

The following chunk expands our analysis of EdStats_attend.csv to quantify and visualize the relationship between school enrollment and attendance across education levels in the United States.
After reshaping the dataset into tidy long form, we standardize indicator names to capture both spellings of “enrolment/enrollment” and collapse duplicate series to obtain a single observation for each year × level × type combination. The long format allows us to have columns for country name and country code (only really needed for reproducibility), variable name, variable value, year, type (simplified variable name: attendance/enrollment), and level (renamed: primary school to elementary school, lower secondary to middle school, and upper secondary to high school). Previously, the “indicator column” contained school type, school level, and sex all in one string. With this new format, we can plot variables in ggplot with column names.

#Attendacnce gap data need to be fixed i will do this if i have time i might have fucked with it From these, we compute a new variable representing the attendance gap (Enrollment – Attendance) to measure how much access (enrollment) translates into sustained participation (attendance).

Reading tip:
Focus on whether attendance consistently trails enrollment and whether that gap changes across levels or decades. High correlation indicates structural progress—both metrics rising together—while a persistent positive gap suggests that access alone does not guarantee participation.

data_attend <- read_csv("EdStats_attend.csv", show_col_types = FALSE)

#long format
data_long_attend <- data_attend %>%
  pivot_longer(cols=`1975`:`2022`, names_to="Year", values_to="Value") %>%
  mutate(
    Year = as.numeric(Year),
    Type = case_when(
      grepl("attendance", `Indicator name`, ignore.case=TRUE) ~ "Attendance",
      grepl("enrolment", `Indicator name`, ignore.case=TRUE) ~ "Enrollment",
      TRUE ~ "Other"
    ),
    Level = case_when(
      grepl("primary", `Indicator name`, ignore.case=TRUE) ~ "Elementary School",
      grepl("lower secondary", `Indicator name`, ignore.case=TRUE) ~ "Middle School",
      grepl("upper secondary", `Indicator name`, ignore.case=TRUE) ~ "High School",
      TRUE ~ "Other"
    )
  ) %>%
  filter(Type != "Other", !is.na(Value)) %>%
  mutate(
    Level = factor(Level, levels = c("Elementary School", "Middle School", "High School")),
    Type = factor(Type, levels = c("Attendance", "Enrollment"))
  )
write.csv(data_long_attend, "data_long_attend.csv")


#distribution plot options
#violin plot (too messy i think unfortunately)
ggplot(data_long_attend, aes(x=factor(Year, levels=sort(unique(Year))), y=Value, fill=Type)) +
  geom_violin(alpha=0.5, trim=FALSE) +
  theme_minimal() +
  labs(
    title="Distribution of Attendance vs Enrollment Rates by Year (Pre-filtering)",
    x="Year",
    y="Rate (%)",
    fill="Type"
  )+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#scatter plot. i added a little bit of jitter cuz there were some overlapping points
ggplot(data_long_attend, aes(x=factor(Year), y=Value, color=Type)) +
  geom_jitter(width=0.1, alpha=0.7, size=3) +
  theme_minimal() +
  labs(
    title="Distribution of Attendance vs Enrollment Rates by Year (Pre-filtering)",
    x="Year",
    y="Rate (%)",
    color="Type"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#boxplot and jittered data. too cluttered i think. 
ggplot(data_long_attend, aes(x=factor(Year), y=Value, fill=Type)) +
  geom_boxplot(alpha=0.3, outlier.shape = 16, outlier.color="red") +
  geom_jitter(aes(color=Type), width=0.1, size=2) +
  theme_minimal() +
  labs(title="Distribution of Attendance vs Enrollment Rates by Year (Pre-filtering)", x="Year", y="Rate (%)") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

#just boxplot. i think still too cluttered. we have no outliers anyways
ggplot(data_long_attend, aes(x=factor(Year), y=Value, fill=Type)) +
  geom_boxplot(alpha=0.5, outlier.shape = 16, outlier.color="red") +
  theme_minimal() +
  labs(title="Distribution of Attendance vs Enrollment Rates by Year (Pre-filtering)", x="Year", y="Rate (%)") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

The above plot(s) show the distribution of the data before we focus in on the years with data for both variables. As can be seen in the plot(s), the earlier years contain only enrollment data, while attendance data is added around 2005. Ggplot uses interquartile range (IQR) to calculate outliers, which works for out dual-variable analysis and we do not need to use any multivariable outlier calculation techniques. Any outliers would be marked with red dots in the box plot, and seeing as there are none, we do not need to remove any outliers.

We have data in long format, and next we make two more dataframes: one in wider format and one in a combined format. The wider dataframe is only wider in comparison to the long df, not the raw data. It separates the “type” column (attendance and enrollment rates) from the long dataframe into two columns, one for enrollment rate and one for attendance rate. However, this creates NA values the rates are not lined up by year and level. Thus, we create a final data combined dataframe that addresses this issue and lines up year, level, attendance, and enrollment. The data combined contains only the values have data for both attendance and enrollment in the same year and level, which leaves us with substantially less data than we started with. However, this is necessary for accurate plotting and calculations.

data_attend <- read_csv("data_long_attend.csv", show_col_types = FALSE)

#select years with most observations
data_attend_years <- data_attend %>%
  filter(!Year %in% c(1975, 1986, 1987, 1990, 1991, 1993, 1994, 1995, 1996, 1999))
write_csv(data_attend_years, "data_long_attend.csv")

#wide format
data_wide_attend <- data_attend_years %>%
  pivot_wider(
    names_from=Type,
    values_from=Value
  ) #%>% view()

#combined format
data_combined_attend <- data_wide_attend %>%
  group_by(Level, Year) %>%
  summarise(
    Attendance = Attendance[!is.na(Attendance)],
    Enrollment = Enrollment[!is.na(Enrollment)]
  ) %>%
  ungroup() %>%
  mutate(Level = factor(Level, levels = c("Elementary School", "Middle School", "High School"))) #%>% view()
write_csv(data_combined_attend, "data_combined_attend.csv")

#add selected distribution plot to show new distribution

We now look at the trends in the data. Using our long and combined datasets, we can build visuals to display how attendance and enrollment rates change as time progresses. We refactor the data after loading it in so that the variables are ordered correctly in the plots, and then build the plots themselves.

The following three plots presents annual attendance and enrollment rates for US elementary, middle, and high schools. Each color represents a different variable as can be seen in the legend. Each panel shows how these two metrics evolve side by side within a given school level. Enrollment rates include more fluctuation than attendance rates for all three levels of schools, especially in high schools. Please note that attendance data is available one year later than enrollment data.

Readers can hover over the plots to see specific data point information, zoom in or out, or double click a variable on the legend to see only the data for the variable.

data_long <- read_csv("data_long_attend.csv", show_col_types = FALSE) %>%
  mutate(
    Level = factor(Level, levels = c("Elementary School", "Middle School", "High School")),
    Type = factor(Type, levels = c("Attendance", "Enrollment"))
  )

p_line_per_school_horizontal <- ggplot(data_long, aes(x=Year, y=Value, color=Type)) +
  geom_line(size=1.2, na.rm=TRUE) +
  geom_point(size=2, alpha=0.8) +
  facet_grid(~Level, scales="fixed", switch="y") +
  theme_minimal() +
  theme(
    strip.background=element_rect(fill="grey90", color=NA),
    panel.spacing=unit(1, "lines")
  ) +
  labs(
    title="Attendance vs. Enrollment Trends in Schools Over Time (USA)",
    y="Rate (%)",
    color="Variable"
  ) +
  scale_color_manual(values=c("Attendance"="#FF69B4", "Enrollment"="#3bbf8f")
  )
# ggsave("line_per_school_horizontal.png", p_line_per_school_horizontal)

p_line_interactive1 <- ggplotly(p_line_per_school_horizontal)
p_line_interactive1

Next, we visualized how enrollment trends changed over time across the three different school levels Each color represents the variable shown in the legend, each line represents the linear trend, and shaded regions show 95% confidence intervals. We see a steady increase in high schools, which may suggest possible demographic shifts or improved retention in upper grades. In contrast, enrollment in elementary schools slightly decreased over time, indicating declining birth cohorts feeding into early education. Middle school enrollments remained relatively stable. It is important to note the COVID quarantine started in early 2020, and both Elementary and Middle school enrollment was lowest the following year.

Readers can hover over the plot to see specific data point information, zoom in or out, or double click a variable on the legend to see only the data for the variable.

data_combined <- read_csv("data_combined_attend.csv", show_col_types = FALSE) %>%
  mutate(
    Level = factor(Level, levels = c("Elementary School", "Middle School", "High School"))
  )

enroll_longitudinal <- ggplot(data_combined, aes(x=Year, y=Enrollment, color=Level)) +
  geom_point(size=2) +
  geom_smooth(method=lm, size=1, alpha=0.25) +
  theme_minimal() +
  labs(title=" Enrollment Observations Longitudinally",
       x="Year",
       y="Entrollemnt",
       color="Level")
enroll_longitudinal_plotly <- ggplotly(enroll_longitudinal)
enroll_longitudinal_plotly
# pretty_cor_table(data_combined$Year, data_combined$Enrollment, "Year", "Enrollment")

We also visualized how attendance trends changed over time across the 3 different types of schools, which can be seen in the plot below. Each color represents the variable shown in the legend, each line line represents the linear trend, and shaded regions show 95% confidence intervals. We see gradual downward trends in all three school types, indicating that attendance has slightly declined from 2014 to 2021. Among the three school types, middle schools consistently maintain the highest attendance rates followed by elementary and then high schools. The decline was steepest in elementary schools, suggesting attendance among younger students may have been more sensitive to disruptions such as policy shifts or COVID-19 pandemic.

Readers can hover over the plots to see specific data point information, zoom in or out, or double click a variable on the legend to see only the data for the variable.

data_combined <- read_csv("data_combined_attend.csv", show_col_types = FALSE) %>%
  mutate(
    Level = factor(Level, levels = c("Elementary School", "Middle School", "High School"))
  )

attend_longitudinal <- ggplot(data_combined, aes(x=Year, y=Attendance, color=Level)) +
  geom_point(size=2) +
  geom_smooth(method=lm, size=1, alpha=0.25) +
  theme_minimal() +
  labs(title=" Attendance Observations Longitudinally",
       x="Year",
       y="Attendance",
       color="Level")
attend_longitudinal_plotly <- ggplotly(attend_longitudinal)
attend_longitudinal_plotly
# pretty_cor_table(data_combined$Year, data_combined$Attendance, "Year", "Attendance")

The following statistics table summarizes the central tendency and variability of attendance and enrollment rates across US elementary, middle, and high schools. Click the arrow in the top right corner to see the rest of the values.

For attendance, mean rates range from 96.4% to 97.7%, indicating high participation levels. Elementary schools exhibit the widest range (3.39%) and the highest variability (SD = 1.32). This suggests that there is greater fluctuation in attendance among younger students. For enrollment, mean rates range from 94.96% to 99.2%, indicating generally high enrollment rates. Here, high schools display the largest spread (range = 7.38) and the highest variance (4.79). While attendance remained uniformly high across all levels, enrollment volatility increased with school level.

summary_stats <- data_long %>%
  group_by(Type, Level) %>%
  summarise(
    n = n(),                           
    Mean = mean(Value, na.rm=TRUE),
    Median = median(Value, na.rm=TRUE),
    SD = sd(Value, na.rm=TRUE),
    Variance = var(Value, na.rm=TRUE),
    Minimum = min(Value, na.rm=TRUE),
    Maximum = max(Value, na.rm=TRUE),
    Range = Maximum - Minimum,
    Q1 = quantile(Value, 0.25, na.rm=TRUE),
    Q3 = quantile(Value, 0.75, na.rm=TRUE),
    IQR = Q3 - Q1,
  ) %>%
  arrange(Type, Level)  

paged_table(summary_stats)
# write_csv(summary_stats, "tendency_attend.csv")

#maybe add some plot showing something noteable about the tendency measures, or otherwise if there are any interesting ones we can color or bold them.

#following chunk contains attendance gap data that needs to be fixed.

# 2) Collapse duplicates within (Year, Level, Type)
#    (If the catalog has multiple series for the same concept, average them.)
# collapsed <- data_long %>%
#   group_by(Level, Year, Type) %>%
#   summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop")

# 3) Wide format with both columns side-by-side
# gap_df <- collapsed %>%
#   pivot_wider(names_from = Type, values_from = Value) %>%
#   mutate(Gap = Enrollment - Attendance)

# 5) Gap over time (only where both exist)
# if (any(!is.na(gap_df$Gap))) {
#   p_gap <- ggplot(gap_df, aes(Year, Gap, color = Level)) +
#     geom_hline(yintercept = 0, linetype = "dashed") +
#     geom_line(linewidth = 1, na.rm = TRUE) +
#     geom_point(size = 1.3, alpha = 0.8) +
#     labs(title = "Enrollment – Attendance Gap (percentage points)", y = "Gap (pp)", x = "Year") +
#     theme_minimal(base_size = 12)
#   print(p_gap)
# }

# Quick diagnostics (see what exists)
# print(table(gap_df$Level, useNA = "ifany"))
# print(summary(gap_df[, c("Enrollment", "Attendance")]))
# 
# # If there are no overlapping rows, bail out gracefully
# if (nrow(drop_na(gap_df, Enrollment, Attendance)) < 2) {
#   warning("No overlapping Enrollment & Attendance observations after cleaning; skipping correlation and scatter.")
# } else {
#   # Correlation and scatter only when data exist
#   corr_df <- gap_df %>% drop_na(Enrollment, Attendance)
# 
#   r_val <- cor(corr_df$Enrollment, corr_df$Attendance, use = "pairwise.complete.obs", method = "pearson")
#   message(sprintf("Correlation between Enrollment and Attendance: r = %.2f", r_val))
# 
#   p_scatter <- ggplot(corr_df, aes(Enrollment, Attendance, color = Level)) +
#     geom_point(alpha = 0.6) +
#     geom_smooth(method = "lm", se = FALSE, linewidth = 0.9) +
#     labs(
#       title = sprintf("Attendance vs. Enrollment (r = %.2f)", r_val),
#       x = "Enrollment Rate (%)", y = "Attendance Rate (%)"
#     ) +
#     theme_minimal(base_size = 12)
#   print(p_scatter)
# }

Next, we calculate R^2 and p-values for attendance and enrollment data combined. This did not return anything significant, and given that we only have three levels (elementary, middle, high school) per year we don’t have enough statistical power to run the model filtered in that way. If we filter by school level, we have 5 observations per level. It is important to note that while n=5 is technically valid, it is very low for significance testing and therefore should be taken with a grain of salt.

Overall, we did not find any significant results so we cannot conclude that enrollment predicts attendance in this dataset.

data_combined <- read_csv("data_combined_attend.csv", show_col_types = FALSE)

model <- lm(Enrollment ~ Attendance, data=data_combined)

model_glance <- glance(model) %>%
  select(r.squared, p.value)
model_glance <- model_glance %>% mutate(Model = "Enrollment ~ Attendance")
model_glance <- model_glance %>% select(Model, everything())

pretty_r2_table(model_glance, "R² and p-value for Enrollment ~ Attendance Overall (Red = significant, Black = not significant)")
R² and p-value for Enrollment ~ Attendance Overall (Red = significant, Black = not significant)
Model r.squared p.value
Enrollment ~ Attendance 0.1097253 0.2278019
by_schooling_r2 <- data_combined %>%
    group_by(Level) %>%
    do(glance(lm(Enrollment ~ Attendance, data=.))) %>%
    select(Level,r.squared,p.value)

pretty_r2_table(by_schooling_r2, "R² and p-values by School Level (Red = significant, Black = not significant)")
R² and p-values by School Level (Red = significant, Black = not significant)
Level r.squared p.value
Elementary School 0.1355655 0.5420218
High School 0.2899115 0.3491842
Middle School 0.0677148 0.6724550

#add regression here

#regression

6. Teachers vs. Compensation — Reshape, Compare, and Quantify

This section examines how teacher workforce size and compensation have evolved in the United States from 2014 to 2021.
Using EdStats_teacher.csv, we identify indicators relating to the number of teachers and the share of educational expenditure allocated to teaching staff compensation.
After filtering out non-teaching staff and qualification percentages, the data are reshaped into tidy long form to create two new analytical variables:

Teacher counts are scaled per 10 000 to make them visually comparable to percentages.
We then:

Reading tip:
Focus on whether the compensation share rises or falls with the number of teachers.
Parallel upward trends may indicate sustained investment in instructional personnel, while divergence could suggest funding reallocation, changing class sizes, or reporting gaps across education levels.

### This is not done


# 1) Read & filter (keep only teaching staff; drop qualification % series)
teacher_raw <- read_csv("EdStats_teacher.csv", show_col_types = FALSE) %>%
  filter(!str_detect(`Indicator name`, regex("non-?teaching staff", ignore_case = TRUE))) %>%
  filter(!str_detect(`Indicator name`, regex("percentage of qualified", ignore_case = TRUE)))

# 2) Long format (auto-detect year columns) + standardized labels
teachers_long0 <- teacher_raw %>%
  pivot_longer(cols = matches("^\\d{4}$"), names_to = "Year", values_to = "Value") %>%
  mutate(
    Year = as.numeric(Year),
    Type = case_when(
      str_detect(`Indicator name`, regex("compensation", ignore_case = TRUE)) ~
        "Compensation % of Total Expenditure",
      str_detect(`Indicator name`, regex("teachers? in|number of teachers", ignore_case = TRUE)) ~
        "Number of Teachers",
      TRUE ~ NA_character_
    ),
    Level = case_when(
      str_detect(`Indicator name`, regex("\\bpre-?primary\\b", ignore_case = TRUE)) ~ "Pre-Primary",
      str_detect(`Indicator name`, regex("\\bprimary\\b", ignore_case = TRUE)) ~ "Primary",
      str_detect(`Indicator name`, regex("lower\\s*secondary", ignore_case = TRUE)) ~ "Lower Secondary",
      str_detect(`Indicator name`, regex("upper\\s*secondary", ignore_case = TRUE)) ~ "Upper Secondary",
      str_detect(`Indicator name`, regex("\\bsecondary\\b", ignore_case = TRUE)) ~ "Secondary (unspecified)",
      str_detect(`Indicator name`, regex("\\btertiary\\b|post-?secondary|higher education", ignore_case = TRUE)) ~ "Tertiary",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(Type), !is.na(Level), !is.na(Value))

# 3) Collapse duplicates within (Year, Level, Type) to ensure 1 row per cell
teachers_long <- teachers_long0 %>%
  group_by(Level, Year, Type) %>%
  summarise(Value = mean(Value, na.rm = TRUE), .groups = "drop")

# 4) Scale teacher counts to 10,000s for visual comparability with percentages
teachers_long <- teachers_long %>%
  mutate(Value_scaled = if_else(Type == "Number of Teachers", Value / 10000, Value))

# (Optional) order levels for consistent faceting
lvl_order <- c("Pre-Primary", "Primary", "Lower Secondary", "Upper Secondary",
               "Secondary (unspecified)", "Tertiary")
teachers_long <- teachers_long %>%
  mutate(Level = factor(Level, levels = intersect(lvl_order, unique(Level))))

# ---- A. Time-series trends by level (scaled) ----
p_line_teacher <- ggplot(teachers_long, aes(x = Year, y = Value_scaled, color = Type)) +
  geom_line(linewidth = 1.1, na.rm = TRUE) +
  geom_point(size = 1.7, alpha = 0.8) +
  facet_wrap(~ Level, ncol = 3, scales = "fixed") +
  theme_minimal(base_size = 12) +
  theme(strip.background = element_rect(fill = "grey95", color = NA)) +
  scale_color_manual(
    values = c("Number of Teachers" = "#FF69B4",
               "Compensation % of Total Expenditure" = "#3BBF8F"),
    labels = c("Number of Teachers (per 10,000)", "Compensation (% of total)")
  ) +
  labs(
    title = "Teachers & Compensation Trends by Level (USA)",
    y = "Scaled value (Teachers per 10,000; Compensation %)",
    color = "Series"
  ) +
  scale_x_continuous(breaks = pretty(unique(teachers_long$Year)))
p_line_teacher

# Interactive view (kept optional for HTML)
ggplotly(p_line_teacher)
# ---- B. Coverage diagnostics (counts by Type / Year / Level) ----
p_count_type <- ggplot(teachers_long, aes(x = Type)) +
  geom_bar(fill = "#c562f0") +
  theme_minimal(base_size = 12) +
  labs(title = "Coverage: Count of Observations by Type", x = "Series", y = "Count")
p_count_type

p_count_year <- ggplot(teachers_long, aes(x = factor(Year))) +
  geom_bar(fill = "#c562f0") +
  theme_minimal(base_size = 12) +
  labs(title = "Coverage: Observations per Year (all levels)", x = "Year", y = "Count")
p_count_year

p_count_level <- ggplot(teachers_long, aes(x = Level, fill = Type)) +
  geom_bar(position = "dodge") +
  theme_minimal(base_size = 12) +
  labs(title = "Coverage: Observations by Level and Type", x = "Level", y = "Count", fill = "Series")
p_count_level

# ---- C. Correlation between staffing and compensation (where both exist) ----
# Wide table on the *scaled* values so axes are comparable
wide_tc <- teachers_long %>%
  select(Level, Year, Type, Value_scaled) %>%
  pivot_wider(names_from = Type, values_from = Value_scaled)

# overall (pooled) correlation with guard
corr_tbl <- tibble::tibble()

valid_overall <- wide_tc %>%
  mutate(valid = complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)) %>%
  filter(valid)

if (nrow(valid_overall) >= 3) {
  overall_r <- cor(valid_overall$`Number of Teachers`,
                   valid_overall$`Compensation % of Total Expenditure`,
                   use = "complete.obs", method = "pearson")
  corr_tbl <- bind_rows(corr_tbl, tibble(Level = "All levels (pooled)", r = overall_r))
}

# by-level correlations (no cur_data(); compute within each group)
by_level_r <- wide_tc %>%
  group_by(Level) %>%
  summarise(
    n_pairs = sum(complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)),
    r = ifelse(
      n_pairs >= 3,
      cor(`Number of Teachers`, `Compensation % of Total Expenditure`,
          use = "complete.obs", method = "pearson"),
      NA_real_
    ),
    .groups = "drop"
  ) %>%
  filter(!is.na(r))

corr_tbl <- bind_rows(corr_tbl, by_level_r %>% select(Level, r))
corr_tbl
## # A tibble: 6 × 2
##   Level                        r
##   <chr>                    <dbl>
## 1 All levels (pooled)     -0.313
## 2 Pre-Primary             -0.571
## 3 Primary                  0.551
## 4 Lower Secondary         -0.830
## 5 Upper Secondary         -0.862
## 6 Secondary (unspecified) -0.334
# ---- D. Summary statistics (per Type × Level) ----
summary_stats_teachers <- teachers_long %>%
  group_by(Type, Level) %>%
  summarise(
    n = n(),
    mean_value = mean(Value, na.rm = TRUE),
    median_value = median(Value, na.rm = TRUE),
    sd_value = sd(Value, na.rm = TRUE),
    min_value = min(Value, na.rm = TRUE),
    max_value = max(Value, na.rm = TRUE),
    IQR = IQR(Value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(Type, Level)

write_csv(summary_stats_teachers, "tendency_teachers.csv")
summary_stats_teachers
## # A tibble: 12 × 9
##    Type         Level     n mean_value median_value sd_value min_value max_value
##    <chr>        <fct> <int>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl>
##  1 Compensatio… Pre-…     6       49.0         49.0  6.83e-1      48.1      49.8
##  2 Compensatio… Prim…     6       49.0         49.0  6.83e-1      48.1      49.8
##  3 Compensatio… Lowe…     6       49.0         49.0  6.83e-1      48.1      49.8
##  4 Compensatio… Uppe…     6       49.0         49.0  6.83e-1      48.1      49.8
##  5 Compensatio… Seco…     6       46.8         46.7  9.30e-1      46.1      48.7
##  6 Compensatio… Tert…     6       28.5         28.3  7.90e-1      27.6      30.0
##  7 Number of T… Pre-…    12   531226.      608720.   1.40e+5  302255    651997. 
##  8 Number of T… Prim…    17  1604827.     1687937    1.49e+5 1414000   1774348. 
##  9 Number of T… Lowe…    14   810580.      858595.   8.81e+4  670172    894725. 
## 10 Number of T… Uppe…    14   774670.      813548.   7.75e+4  650603    848440. 
## 11 Number of T… Seco…    15  1501112.     1661375    2.95e+5  784414.  1737206. 
## 12 Number of T… Tert…    22   853184.      747500    3.14e+5  574000   1581424  
## # ℹ 1 more variable: IQR <dbl>

7. Discussion

Interim Takeaways

  • Enrollment vs. Attendance: Attendance rates consistently trail enrollment rates at all levels, though the gap varies by level and year. This may point to structural barriers beyond access.
  • Teachers vs. Compensation: Preliminary trends suggest whether compensation shares grow with staffing. Divergences could signal reallocations or coverage differences.

These observations are preliminary and will guide further analysis.

Reproducibility Notes
All intermediate datasets and figures are written to disk (EdStats_USA.csv, EdStats_attend.csv, EdStats_teacher.csv, and exported PNGs).
This allows others to review and re-run later steps without repeating earlier processing.
Consider adding sessionInfo() at the end to document package versions.


8. Works Cited

reference string

reference string

reference string